Utilizing Multi-Field Text Features for Efficient Email Spam Filtering

نویسندگان

  • Wuying Liu
  • Ting Wang
چکیده

Large-scale spam emails cause a serious waste of time and resources. This paper investigates the text features of email documents and the feature noises among multi-field texts, resulting in an observation of a power law distribution of feature strings within each text field. According to the observation, we propose an efficient filtering approach including a compound weight method and a lightweight field text classification algorithm. The compound weight method considers both the historical classifying ability of each field classifier and the classifying contribution of each text field in the current classified email. The lightweight field text classification algorithm straightforwardly calculates the arithmetical average of multiple conditional probabilities predicted from feature strings according to a string-frequency index for labeled emails storing. The string-frequency index structure has a random-sampling-based compressible property owing to the power law distribution and can largely reduce the storage space. The experimental results in the TREC spam track show that the proposed approach can complete the filtering task in low space cost and high speed, whose overall performance 1-ROCA exceeds the best one among the participators at the trec07p evaluation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Active Multi-Field Learning for Spam Filtering

Ubiquitous spam messages cause a serious waste of time and resources. This paper addresses the practical spam filtering problem, and proposes a universal approach to fight with various spam messages. The proposed active multi-field learning approach is based on: 1) It is cost-sensitive to obtain a label for a realworld spam filter, which suggests an active learning idea; and 2) Different messag...

متن کامل

Embedded-Text Detection and Its Application to Anti-Spam Filtering

Embedded-Text Detection and Its Application to Anti-Spam Filtering Ching-Tung Wu Embedded-text in images usually carry important messages about the content. In the past, several algorithms have been proposed to detect text boxes in video frames. Previous work often followed a multi-step framework using a combination of image-analysis and machine-learning techniques. In this work, we propose a u...

متن کامل

Fusion of Text and Image Features: A New Approach to Image Spam Filtering

While enjoying the convenience of email communications, many users have also experienced annoying email spam. Even if the current spam detecting approaches have gained a competitive edge against text-based email spam, they still face the challenge arising from imagebased spam (image spam in short). Image spam normally includes embedded images that contain the spam messages in binary format rath...

متن کامل

Spam Filtering Based on Supervised Latent Semantic Features Extraction

Spam text is an universal phenomenon on the “open web”, including large-scale email systems and the growing number of Blogs. Handling this information overload is becoming an increasingly challenging problem, A promising approach is the using of content-based filtering. In this paper, our focus is placed on finding effective dimension reduction method for email Spam filtering, we apply a superv...

متن کامل

Evolutionary Symbiotic Feature Selection for Email Spam Detection

This work presents a symbiotic filtering approach enabling the exchange of relevant word features among different users in order to improve local anti-spam filters. The local spam filtering is based on a ContentBased Filtering strategy, where word frequencies are fed into a Naive Bayes learner. Several Evolutionary Algorithms are explored for feature selection, including the proposed symbiotic ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Int. J. Computational Intelligence Systems

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2012